The dataset for crime in Seattle in 2015 contains 2050 instances in which the location coordinates are recorded as zero (e.g. latitude and longitude of zero). As this location would be in the Atlantic Ocean, I will treat these instances as having missing coordinate information. Because I will be using the coordinate data to plot recorded instances of crime on a Seattle map and thus obtain an idea as to the distributions of certain types of crimes accross the city, careful examination of these missing data is warranted. It may also reflect differences in the mannner in which crime is recorded. Does this vary with respect to police district or the particular type of crime?
library(tidyverse)
## ── Attaching packages ────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1.9000 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.7.2 ✔ stringr 1.2.0
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## ── Conflicts ───────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
sanfrancisco <- read.csv("sanfrancisco_incidents_summer_2014.csv")
seattle <- read.csv("seattle_incidents_summer_2014.csv")
#To avoid repetition of code in the generation various conditional distributions, I
#defined a function that produces both the dataframes with the tallies and the interactive plots for
#examining the possible relations between missingness and other factors.
calculate_conditional_distribution <- function(column) {
column <- enquo(column)
zero_assess <- seattle %>%
mutate(Position_Data_Missing = Latitude == 0) %>%
group_by(Position_Data_Missing, !!column) %>%
summarize(m = n()) %>%
group_by(!!column) %>%
mutate(n = sum(m)) %>%
ungroup() %>%
mutate(sum_total = sum(m)) %>%
group_by(Position_Data_Missing) %>%
mutate(sum_zero_test = sum(m)) %>%
mutate(total_dist = n / sum_total) %>%
mutate(conditional_distribution = m / sum_zero_test) %>%
ungroup() %>%
group_by(!!column)
#initialize list of items to be returned (a d.f. and a ggplot)
return_list <- list()
return_list[["df"]] <- zero_assess
#create visualizations for assessment of independence
plot <- zero_assess %>%
filter(n > 10) %>%
mutate(Position_Data_Missing =
ifelse(Position_Data_Missing == TRUE,
"missing", "not missing")) %>%
ggplot() +
geom_col(aes_(x = column, y = ~conditional_distribution,
fill = ~Position_Data_Missing)) +
labs(x = str_c(quo_name(column),
"\n(move cursor over bars to see the categories)"),
y = "conditional distribution") +
ggtitle("Assessing the Effect of Missingness of Position Data") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.text.x = element_blank(),
legend.title = element_blank()) +
guides(fill = guide_legend(reverse=TRUE))
return_list[["ggplot"]] <- plot
interactive_plot <-
ggplotly(plot) %>%
layout(margin = list(b = 50, l = 60, r = 10, t = 80))
return_list[["plotly"]] <- interactive_plot
return(return_list)
}
We thus find the distributions of certain general categories of crime, as indicated by the variable Summarized.Offense.Description in the data, conditional on both missingness of and non-missingness. If missingness is simply random—in this case, not associated with the type of crime reflected in the “summarized offense description”—we would expect these two conditional distributions to look roughly the same.
results <- calculate_conditional_distribution(Summarized.Offense.Description)
results[["plotly"]]
In the above, for the types of crimes with large numbers of instances, the lengths of the each of the red bars is roughly equal to the length of the green bar below it. We thus find little evidence of a systematic relationship between missingness of coordinate data and the particular type of reported crime. However, moving the cursor over the bars, we see that prostitution may appear a notable exception, with perhaps all of the cases recorded with coordinate data. Examination of the data frame generated with the above code shows that there were 202 cases of prostition during the year (from 12/29/2014 through 1/2/2015), and indeed that all of the recorded cases had coordinate data.
Are there other types of crime for which this is the case? Let’s see.
#A simple manipulation of the data derived in calculate_conditional_distribution suffices.
results <- calculate_conditional_distribution(Summarized.Offense.Description)
library(knitr)
kable(results[["df"]] %>% filter(m == n))
| Position_Data_Missing | Summarized.Offense.Description | m | n | sum_total | sum_zero_test | total_dist | conditional_distribution |
|---|---|---|---|---|---|---|---|
| FALSE | [INC - CASE DC USE ONLY] | 5 | 5 | 32779 | 30729 | 0.0001525 | 0.0001627 |
| FALSE | DISORDERLY CONDUCT | 2 | 2 | 32779 | 30729 | 0.0000610 | 0.0000651 |
| FALSE | DUI | 34 | 34 | 32779 | 30729 | 0.0010372 | 0.0011064 |
| FALSE | ELUDING | 8 | 8 | 32779 | 30729 | 0.0002441 | 0.0002603 |
| FALSE | ESCAPE | 3 | 3 | 32779 | 30729 | 0.0000915 | 0.0000976 |
| FALSE | HOMICIDE | 8 | 8 | 32779 | 30729 | 0.0002441 | 0.0002603 |
| FALSE | PORNOGRAPHY | 3 | 3 | 32779 | 30729 | 0.0000915 | 0.0000976 |
| FALSE | PROSTITUTION | 202 | 202 | 32779 | 30729 | 0.0061625 | 0.0065736 |
| FALSE | PUBLIC NUISANCE | 4 | 4 | 32779 | 30729 | 0.0001220 | 0.0001302 |
The dataset includes, again, 2050 examples with missing coordinate information, out of 32779 instances in the dataset—roughly 6.3 percent of the data. Given such an overall random distribution, one would expect that, for some of the categories with very few instances in the dataset, all of the instances of reported crimes within the category are recorded with position data. However, given that prostitution has as many as 202 recorded instances, all recorded with postion information, there may be a systematic relationship between this value of Summarized.Offense.Description and missingness. For example, although we can’t infer that this would be the case (a chi squared test may be useful here) perhaps record-keeping in the case of prostitution is more carefully handed than it is in the case of most other offenses.
We now turn to the relationship betwen district sector and missingness.
results <- calculate_conditional_distribution(District.Sector)
results[["plotly"]]
The mostly-red bar at the left end corresonds to data for which no district sector is indicated in the data, for which it would not be surprising that the coordinate data is also missing. Other than, maybe, the mysterious District Sector 99–with a total of only 40 reported instances–the data suggests that the reporting of examples was consistent accross the sectors.
Similar considerations apply to the possible relationship between missingness and Zone Beat.
results <- calculate_conditional_distribution(Zone.Beat)
results[["plotly"]]
In this case, minor differences in the conditional distributions are evident, but likely not significant. (I may conduct a Chi Squared test of this assumption.) We may thus reasonably suppose that a plot of the crime data on a map of the city would not distort the distribution of actual recorded crime across Seattle. Thus plotting all of the instantances that have position data (latitude and longitude) in the table, we obtain the following.
library(ggmap)
##
## Attaching package: 'ggmap'
## The following object is masked from 'package:plotly':
##
## wind
#obtain a suitable Seattle map
seattle_gg <- get_map("Seattle", maptype = "toner-lite",
zoom = 11)
## maptype = "toner-lite" is only available with source = "stamen".
## resetting to source = "stamen"...
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Seattle&zoom=11&size=640x640&scale=2&maptype=terrain&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Seattle&sensor=false
## Map from URL : http://tile.stamen.com/toner-lite/11/326/714.png
## Map from URL : http://tile.stamen.com/toner-lite/11/327/714.png
## Map from URL : http://tile.stamen.com/toner-lite/11/328/714.png
## Map from URL : http://tile.stamen.com/toner-lite/11/329/714.png
## Map from URL : http://tile.stamen.com/toner-lite/11/326/715.png
## Map from URL : http://tile.stamen.com/toner-lite/11/327/715.png
## Map from URL : http://tile.stamen.com/toner-lite/11/328/715.png
## Map from URL : http://tile.stamen.com/toner-lite/11/329/715.png
## Map from URL : http://tile.stamen.com/toner-lite/11/326/716.png
## Map from URL : http://tile.stamen.com/toner-lite/11/327/716.png
## Map from URL : http://tile.stamen.com/toner-lite/11/328/716.png
## Map from URL : http://tile.stamen.com/toner-lite/11/329/716.png
#remove zero latitude and longitude elements, for mapping
seattle_nonzero <- seattle %>%
filter(Latitude != 0 & Longitude != 0)
#show the distribution of recorded reports of crime on the map, using a logarithmic scale to enhance
#the contrast
plot <- ggmap(seattle_gg, darken = c(.01, "black")) +
geom_bin2d(data = seattle_nonzero,
aes(x = Longitude, y = Latitude),
bins = 200) +
scale_fill_gradient(trans = "log10")
plot
Now focuss on violent crimes, with these defined as crimes with “Summarized Offense Description” of Assault, Homicide, or Robbery.
library(ggmap)
Violent_Crimes <- c("ASSAULT", "HOMICIDE", "ROBBERY")
data <- seattle_nonzero %>%
filter(is.element(Summarized.Offense.Description, Violent_Crimes))
ggmap(seattle_gg, darken = c(.01, "black")) +
geom_bin2d(data = data,
aes(x = Longitude, y = Latitude),
bins = 200) +
scale_fill_gradient(trans = "log10")
And drugs:
data <- seattle_nonzero %>%
filter(Summarized.Offense.Description == "NARCOTICS")
ggmap(seattle_gg, darken = c(.01, "black")) +
geom_bin2d(data = data,
aes(x = Longitude, y = Latitude),
bins = 200) +
scale_fill_gradient(trans = "log10")